Shaham T R, Dekel T, Michaeli T. Singan: Learning a generative model from a single natural image[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 4570-4580.
1. Overview
In this paper, it proposes SinGAN model only from a single natural image
1) A pyramid of fully Conv cGANs. Each for learning patch distribution.
2) Generate new samples of arbitary size and aspect ratio.
3) Apply to different tasks.
2. Method
2.1. Formulation
1) $\tilde{x}_N = G_N(z_N)$
2) $\tilde{x}_n = G_n(z_n, (\tilde{x} _{n+1}) \uparrow^r), n \lt N = (\tilde{x} _{n+1}\uparrow^r) + \psi_n(z_n + (\tilde{x} _{n+1}) \uparrow^r )$
3) $\psi_n$ contains 5 Conv-BN-ReLU.
4) Start with 32 kernels per block at the coarsest scale. And increase by a factor of 2 every 4 scales.
2.2. Loss Function
$min_{G_n} max_{D_n} L_{adv} (G_n, D_n) + \alpha L_{rec}(G_n)$.
Reconstruction Loss
1) Ensure there exists a specific set of input noise maps
$\lbrace z_N^{rec}, z_{N-1}^{rec}, …, z_0^{rec} \rbrace = \lbrace z^{*}, 0, …, 0 \rbrace$.
2) $z^{*}$ fixed noise map during traning.
3) $L_{rec} = || G_n(0, ( \tilde{x} _{n+1}^{rec} ) \uparrow^r) - x_n ||^2$.
4) $L_{rec} = || G_N(z^{*}) - x_N ||^2, n = N$
3. Experiments
3.1. Explore SinGAN
1) Starting the generation from finer scales, enables to keep the global strucuture intact.
2) Train with different scales $N$.
3.2. Quantitative Evaluation
3.3. Application
3.3.1. Super-Resolution
1) Reconstruction loss weight $\alpha = 100$.
2) Scale factor $r = \sqrt[ k]{s}, k \in N$
3) Train on LR image, then upsample LR image by $r$ and inject to $G_0$.
3.3.2. Paint-to-Image
1) Downsample clipart image, then feed it to one of coarse scales ($N-1, N-2$)
3.3.3. Harmonization
1) Train on background image, then inject a downsampled version of naively pasted composite at test time.
3.3.4. Editing
1) Inject a downsampled version of the composite into one of the coarse scales.
2) Combine SinGAN’s output at the edited regions, with the original image.
3.3.5. Single Image Animation
1) A random walk in z-space, starting with $z^{rec}$ for the first frame at all generation scales.